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ABSTRACT 

This paper considers the question of bias in group 
administered academic achievement tests, bias which is inherent in 
the instruments themselves. A body of data on the test of performance 
of three disadvantaged minority groups — northern, urban black; 
southern, rural black; and, southwestern, Mexican- Americans — as 
tryout samples in contrast to white^ advantaged groups in the same 
regions, was analyzed using five different general methods for 
examining tests for bias. In an item tryout, a set of items is 
administered to a sample of the relevant population and the results 
are then examined item by item in an effort to pick the more 
effective items. The first method is an item selection routine using 
the point biserial correlation for each item as the criterion. The 
second method, group by score interactions, involves dividing the 
tryout group into, say, fourths, based on qua r tiles, and examining 
the proportion of the cases making each possible response in each of 
these levels. The third method involves plotting item difficulties so 
as to locate aberrant items. The fourth method involves estimating 
and plotting item characteristic curves separately for each group and 
comparing the plots. The fifth method comprises various intergroup 
factor analytic approaches. (Author/JM) 
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In this paper we wish to discuss our explorations of methods of assess- 
ing test bias. We are hoping that this information can be used to construct 
less biased tests as well as contribute to an understanding of the nature 
and sources of bias. 

A biased test is popularly understood to be a test which is unfair to 
identifiable subgroups of the general population in which it is being used. 
Although many people seem to believe the matter is simple, little is actu- 
ally known about the nature of bias in tests and even the most widely ac- 
cepted propositions badly need verification. This is partly because this 
verification is deceptively difficult to obtain for many kinds and uses of 
tests. Sometimes quite indirect methods are needed. Williams, for exam- 
ple, is trying to show that the classic IQ tests favor whites and are unfair 
to blacks by building a similar test favoring blacks and which is unfair to 
whites. A second source of difficulty lies in the ambiguities in the popu- 
lar definition of bias given above. 

Therefore, before proceeding we wish to make a few preliminary points. 
First, bias is presumably a potential attribute of all kinds of tests; to 
keep matters simple we shall limit our discussion to typical achievement 
and/or ability tests. Second, we will call a test any collection of items 
intended to measure a single unitary domain, not collections of test bat- 
teries and other composite collections. Thus when we say "test" many would 
say "subtest." This we hope will also simplify matters. Third, we acknow- 
ledge that the numbers and kinds of subgroups against which a test may be 
biased is nearly endless; again to keep matters simple we will limit our 
discussion to the kinds of subgroups uaed in the studies being reported here 
today. In our work we have used grov^ps defined by four kinds of descrip- 
tors; a) ethnic identification (black, Mexican-American, white), b) type of 
housing area (urban, suburban, rural), c) economic status (middle, low), 
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d) region (Northern, Southern, Southwestern). In our data these categories 
are confounded. 

Finally, we want to establish clearly that the concept of bias as un- 
fairness can be equated directly and more usefully to the proposition that 
bias occurs when a test measures different things for different sets of in- 
dividuals. This definition does not conflict with those used by others for 
test bias in the absence of external validity criteria (e.g., Angoff and 
Ford 1971, ClearJ and Hilton 1968, Potthoff, 1966), and avoids some, but not 
all, of the pitfalls that arise in discussions of fairness. On the face of 
it an unfair test must systematically yield scores for some identifiable 
groups that are improperly high or low; unsystematic error is lack of reli- 
ability, not bias, although consistently different amounts of error between 
groups can be considered a special kind of bias . Bias can occur only when 
two or more groups obtain scores on a test such that the scores of at least 
one group are typically less fair than the scores of at least one other 
group. The question then arises: how can that be? One possible answer is 
that the test has been applied unfairly or improperly to one group but not 
the other. Clearly this sort of thing happens and some believe that is the 
sum total of bias. Certainly it is a serious problem. However it is not 
our topic here because that source of bias is not inherent in the instrument. 
The question here concerns bias built into tests. 

Are there any ways that bias can occur that are not a consequence of 
biased administration? The answer appears to be yes if, and only if, the 
test measures different things for at least two otherv'ise distinguishable 
groups such as the ethnic and cultural groups we are miost concerned with 
here. If a test is properly administered under appropriate conditions and 
yet is biased, the most reasonable explanation is that the test is measuring 
something different for the different groups; otherwise the results would 
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be fair« This can occur in several ways. Table 1 indicates briefly a scheme 
for categorizing these ways. 

To determine bias it is not always necessary to .consider these different 
types of biaSc This is particularly the case when there are unambiguous ex- 
ternal criteria of validity such as in the studies being reported by Caylor 
here today But whenever such criteria are lacking (e.g., in scholastic 
achievement tests) or when the criterion measures may themselves be biased 
(e.g«, the Stanf ord-Binet used as the criterion of group ability tests) then 
these categories suggest ways of coming to understand the nature of the bias. 
Obviously understanding bias is the ultimate goal and is necessary if the 
bias is to be eliminated. In any case the scheme suggests ways of looking 
for bias. 

Type A is simply unequal reliability; the variance attributable to ran- 
dom error is substantially larger for one group than another. Such a test 
would yield more inaccurate scores for one group than for the other although 
the direction of error, unlike the other types of biaa, varies randomly 
within the group. Since one can determine from item tryouts just how much 
each item contributes to reliability, i.e* to the size of the KR 20 given 
the remaining items in a set, control of this sort of bias should ordinarily 
be relatively simple. 

Just how common is this sort of bias? We don*t know* Our data sug- 
gest the amount is very little although it may be a common phenomenon. It 
does seem unlikely that a test would be biased in only this way« We con- 
sider it unlikely partly because no ready explanation for such a phenomenon, 
if it exists, comes to mind. Note that for simplicity of illustration the 
other types of bias are illustrated as though the amount of error would be 
the same for both groups; this also seems unlikely e Unequal reliability 
would probably accompany the four other types, since in each case, one 
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Table 1 



Types of Bias Illustrated by Hypothetical Proportions 
of Variance Attributable to Different Sources 









Sources 


of Variance 






Type 


Group 


Error 


Factor 
1 


Factor 
2 


Factor 
3 


Description 


A 


I 
II 


10 
25 


90 
75 






Unequal relia- 
bility 


B 


I 
II 


10 
10 


30 
5 


10 
40 


50 
45 


Same factors in 
different propor- 
tion 


C 


I 
II 


10 
10 


30 
30 


60 
30 


30 


Additional fac- 
tor(s) for one 
group 


D 


I 
II 


10 
10 


30 
30 


60 


60 


Some common fac- 
tors , some fac- 
tors unique to 
each group 


E 


I 
II 


10 
10 




90 


90 


Nothing in common 



really has different tests for the two groups. 

The second type would appear on rational grounds to be highly likely 
while the remaining three can be considered less probable; for a set of 
test items to actually engage a different set of traits in different groups 
is more difficult to imagine than engaging the same traits in different pro- 
portions. Indeed many of the explanations of bias commonly offered fit 
this latter situation. For example Williams (1970) has suggested that typi- 
cal reading comprehension tests measure more vocabulary among blacks than 
among whites and offers an example of a passage written for blacks which 
would reverse this ^ Since it is well known that the paragraph-type read- 
ing comprehension tests also measure general background — some people know 
more about" the content of the passages than others — at least three inter- 
related factors probably enter into scores on such a test. Simple methods 
of detecting this kind of bias in test construction are not very obvious 
nor do easily executed corrective measures suggest themselves. 

Similar problems arise with each of the remaining types. To be sure, 
the nature of the bias that would occur given one or another of the types 
of bias we have listed would be quite different. Nevertheless most of the 
practical ways of examining tests, especi«illy during construction, do not 
distinguish among them. However, several of the possible corrective mea- 
sures such as differential scoring clearly depend upon knowing with which 
of these types one is dealing. In short our typology serves to highlight 
some of the problems in assessing bias in addition to describing how it 
may be that a test is biased. 

Let us point out here that bias of types B, C, D, E can easily lead 
one group to obtain consistently lower scores than the other; this" is only 
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a possibility, not an inevitable consequence.* Consider again the example 
of a reading comprehension test which might fit Type B* Let it consist of 
questions about reading passages presented in the test. Let us assume that 
for middle class suburban white fifth grade children (group I) the set of 
questions on the passages produce vi highly reliable measure with 10% of 
the variance among scores due to ecror, 30% due to differential prior know- 
ledge of the content, 10% due to^ word knowledge — they all know most of the 
words — and the remaining 50% to reading comprehension skill per se. The 
same instrument for poor inner city black children (group II) might ';e 10% 
error, 5% prior knowledge (perhaps none of them know much about the con- 
tent), 40% vocabulary with the remaining 45% being reading comprehension^ 
The members of the black group would then uniformly score relatively low 
because of poor background knowledge and many would have scores relatively 
low because of poor vocabulary as well. The effect of both variables is a 
lower average score for the blacks. The first factor, prior knowledge, 
contributes little variance to black scores because of uniform lack of in- 
formation while the second factor, vocabulary, contributes little variance 
to white scores because of uniform knowledge. Clearly an interpretation 
of the score as an assessment of status in reading comprehension is doubly 
unfair to blacks given these conditions. 

As a matter of fact we do know that most academic tests, both aptitude 
and achievement J yield consistently higher scores for one set of groups in 
society in contrast to various other groups such as poor people, blacks, 
and Chicanos (Coleman, 1966). Some people overgeneralize these results 
to indicate that the latcer groups are inferior to the former. In so 
doing they are assuming the tests are fair and unbiased (other inappropriate 

* Systematic between group differences in the observed scores need not be 
foiond when there is bias; in such a case the apparent equality of perfor- 
mance is misleading and would not be found if the bias were eliminated. 
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assumptions are often made as well); if this assumption is false their con- 
clusions become opinions without any logical basis. Yet it has not been 
customary practice to examine tests for bias. Obviously one cannot examine 
all tests for possible bias against all kinds of groups. But given the sit- 
uation just described it seems painfully clear that a systematic examination 
for cultural bias of the major published ability and achievement tests now 
in use in our schools is long overdue. 

Simple, readily applied procedures are needed. What we are reporting 
here are some explorations of ways to proceed toward that end. So far we 
have tried five sorts of approaches, none of them definitive. They are, 
(1) the point biserial approach, (2) group by score category interactions 
for mean item dif f iculties , . (3) the adjusted item difficulty approach, (4) 
the estimated item characteristic approach, and (5) the intergroup factor 
approach. The first approach was developed in a study previously reported 
(Green 1972). Since much of our data came from the source used in that 
study and since we wish to include a report of an attempt to verify the 
conclusions reached then we will describe the sample and procedures used 
there first. 

THE POINT BISERIAL APPROACH: INITIAL STUDY 

This study compares the results of using three disadvantaged minority 
groups — northern, urban black; southern, rural black; and southwestern 
Mexican-American — as tryout samples in contrast to white, advantaged groups 
in the same regions. In an item tryout a set of items are administered to 
a sample of the relevant population and the results are then examined item 
by item in an effort to pick the more effective items. 

Would an item tryout using these different groups lead to the selection 
of different items from the if.em pool? If so: 



(1) Do the different items selected measure different things? 

(2) Are the resulting item sets ''better" for the minority groups? 

(3) Will the relative discrepancy in scores favoring majority groups 
be reduced by using a minority tryout group? 

Method 

The data were derived from that obtained during the standardization 
of the California Achievement Tests^ 1970 Edition (CAT - 70) published by 
CTB/McGraw-Hill . The CAT-70 is a general achievement battery with five 
overlapping levels, four of which were used. The standardization took 
place early in 1970 and involved over 200,000 students in about 400 schools. 
The items in the battery came from a variety of sources, but it is fair to 
say that they were written by and for "middle America." The tryout samples 
also fit this description. Thus, the uests should favor white middle-class 
Americans if they favor any group. 

All schools participating in the CAT-70 standardization answered ques- 
tionnaires which provided information on the basic character of the popula- 
tion served. From the data on these questionnaries, seven groups of the 
schools were drawn for this study. The characteristics and sizes of these 
groups are shown in Table 2. The samples used in this study are drawn from 
schools serving pupils highly homogeneous with respect to ethnic background 
and rather homogeneous with respect to socioeconomic status. 

Data Analyses 

The basic procedure used for examining the data was an item selection 
routine using the point biserial correlation for each item as the criterion. 
Each of the seven groups was treated as a tryout sample with the items in 
each test functioning as an item pool. For each group on each test at each 
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grade, the "best" half of the items (i.e., those with the highest item-cest 
correlations) were noted. Four kinds of analysecj were made. 

(1) The number and percent of items chosen for one group in the pair 
but not for the other was recorded. These items will be called "biased." 
The number of these biased items in any one comparison suggests the degree 
to which the two groups interact with the test items in a distinct manner. 

(2) Scores for each group in a pair were obtained on both sets of 
biased items. These two tests may be called the "majority biased test" 
and the "minority biased test" since they contain the items uniquely best 
for the respective groups. The correlation betWren each group's score on 
the two tests was found and estimates of the variance not common to the two 
were made to judge how different the sets of items really are in what they 
measure. 

(3) Another analysis consisted of examining and comparing KR 20 reli- 
ability estimates. 

(4) Finally, mean scores were examined for changes in the relative 
status of the groups as a result of item selection. 

Results 

Proportions of biased items > The medians of the propo.rtions of biased 
items among those selected are shown on Table 3 for all possible pairs of 
groups. The overall median proportion was approximately .30. The white, 
middle-clas'ij groups appear more like each other (these pairs had lower 
medians) than they are like the minority groups. The latter also had more 
in common than they shared with the three majority groups. 

Independence of biased item tests . All gtoups differed from their 
pairs to some degree by the criterion of proportion of biased items, and 
some of the differences appear to be substantial. However, it is possible 
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that these sets of biased items still measure much the same thing. To 
examine this possibility, scores for each individual were obtained on both 
biased item tests. This was possible since each individual answered all 
items e The correlations between these two scores were obtained for each 
group on each test. These correlations varied from -.17 to +.82 with a 
median of about ,5 (.55 for group I and .46 for group II) which leaves a 
lot of variance unaccounted for. Since th3 number of biased items was 
very small in many cases, the reliabilities of the biased tests arc typi- 
cally low; thus the median after correction for attenuation is near .8 
(.84 and .77 respectively; range -.30 to +1.00). But even allowing for 
this, it appears that in many instances the majority and minority tests 
measure somewhat different things and as a rule do so for both groups 
involved. 

Reliability . As noted earlier one case of bias occurs if the test 
scores of one group contain substantially more error than they do for an- 
other group. The overall median KR 20 *s on the full-tests for groups I 
through VII were .91, .91, .91, .92, .93, .90, and .92, respectively. Obvi- 
ously, there is little evidence of bias by this criterion, although a test- 
by-test comparison of these reliabilities shows that the figures are higher 
for the majority group more often than not (97 of 162 comparisons). The 
data concerning reliabilities after item selection also show a very small 
amount of bias so defined. These results do not preclude the possibility 
that other kinds of reliability estimates might show more bias of this sort 
but they do not make it seem likely. 

Changes i n test scores . Another way to look at bias is to assert that 
the scores of some groups are unfairly low because the test does not ade- 
quately measure all the relevant abilities or knowledge, and, in particular, 
does not measure well those relevant attributes on which the group in ques- 
tion happens to score well. If the item pool contains items which measure 
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these attributes at all, a selection routine using this group might be 
expected to increase the importance of these attributes in determining the 
total score, thereby deducing the disadvantage of the group. Therefore, 
the three minority groups considered here might be expected to do relatively 
better on the items selected as best for them than they did on the original 
test^ Each group's improvement on each of the nine tests in the battery 
was compared to the improvement shown by its comparison group. The minority 
groups showed greacer relative improvement consiste / iily in the upper grades, 
but not in grades 1 and 3. As was the case* for proportions of biased items, 
the southern, rural, white group does not fit the pattern: the item selec- 
tion procedure helped them as often as it helped nhe rural blacks, perhaps 
because their initial scores were more alike to begin with, especially in 
the lower grades. 

The argument of the preceding paragraph would appear to have even 
greater force for the biased item tests. 

The majority biased item tests (note this is the set of items best for 
the majority) are almost uniformly more difficult for both groups than are 
the minority biased item tests. The differences between majority group 
mean scores and minority group mean scores are usually smaller on the minor- 
ity biased item tests than on the majority biased item tests « Table 4 shows 
the frequencies of this phenomenon. In most cases the relative advantage of 
the majority groups was reduced when using items chosen as best for the mi- 
nority group but was increased when using items chosen as best for them- 
selves , 

In short, each analysis indicated bias, apparently small in amount, 
but clearly suggesting that ordinary item selection procedures may be pro- 
ducing biased tests. One such study hardly proves the point but it does 
give credibility to the possibility. To examine this possibility more 
closely we set out to both confirm the results of the initial study and 
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then to look at other procedures for assessing bias in achievement tests. 

THE POINT BISERIAL APPROACH: SUBSEQUENT ANALYSES 

First we wish to report on our efforts to verify the outcomes of the 
original study. To do this the data were examined in several ways. The 
principle analyses were intended to provide a look at the stability of the 
data and to yield some cross validations. This was accomplished by redoing 
portions of the study for random halves of each group and applying that out- 
come to the other halves- 

Thus one-half of the grade 5 northern white group (I) was selected 
randomly as was one-half of the grade 5 northern black group (II) making 
four groups. We will call the first half of group I and the first of group 
II the criterion halves and the remaining half of each group the cross vali- 
dation halves . Then using the reading comprehension test, the point biser- 
ial for the first half of group I and again for tha first half of group 
II were found and the "best" items (those with the highest point biserials 
for this half of the group) were selected. As before the "biased" items 
were then determined. This procedure was repeated iOO times for the grade 
5 reading comprehension test using the two northern groups, I and II, so 
that the stability of various statistics could be observed; the other half 
of each group was used as cross validation data. The same kinds of analy- 
ses were done for five other grade test combinations: the grades 3, 5, and 
8 reading comprehension test for groups III and IV; the grade 3 computation 
test and the grade 8 language mechanics test for groups VI and VII. 

Proportion of biased items . The first statistic above in need of ex- 
amination is the proportion of "biased" items summarized in Table 3 which 
shows the median proportion of biased items for the various pairs of groups. 
Table 5 adds detail to this picture confirming that groups I, III, and VII, 
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the majority groups, are similar to each as ar^. the minority groups II, 
IV, and VI when contrasted to the nine possible minority-majority pairs* 
The consistency of this result is very strong since it applies without 
exception to the comparisons of the medians based on all tests within 
each grade as well as those based on all grades for each kind of subtest. 
By this criterion it seems clear that less test bias against a group can 
ordinarily be expected if the items were chosen using data based on a 
similar group ^ 

The stability of the proportion of biased items statistic can also 
be seen in Table 6 which shows the means and standard deviations over the 
100 trials for each of the six tests. The variability seems quite small 
for most of them. 

Another indication of stability is the frequency with which particular 
items were chosen as biased. Table 7 shows how many times for both the 
group I and II and the group III and IV comparisons each of the 42 reading 
comprehension items in the grade 5 test were chosen as biased in favor of 
whites, in favor of blacks, for both or neither. Thirty-six of the 42 
items were categorized in the initial study exactly the same way as they 
were most of the time for the random halves of group I and II while 40 of 
42 were the same for the group III and IV comparison. Clearly most items 
are consistently assigned; contradictory assignment such as being chosen 
for both groups and also being rejected for both are rare. Furthermore, 
particular items tend to be categorized the same in both the group I vs. 
II and group III vs. IV comparisons. In fact the point biserials for groups 
I and III correlate .61; the figure tor II and IV is .85. The corresponding 
item difficulty correlations are .98 and .94. In short both the number of 
biased items and wliich items they are tend to be stable because the items 
have similar characteristics in similar groups. 



Table C 

Number of Biased Items Found in FoUow-up Studies 



Groups 


Grade 


Test 


Initial Study 

No. of (%) 
Items 


100 Repeated Trials 
Mean (%) S.D. 


I & II 


5 


Reading 
Comprehension 


10 


(48) 


8.9 


(43) 


1.4 


III & IV 


3 


Reading 

Comp reh ens io n 


6 


(26) 


8.3 


(36) 


1.3 


III & IV 


5 


Reading 
Comprehension 


9 


(43) 


9.3 


(44) 


1.3 


III & IV 


8 


Reading 
Comprehension 


9 


(39) 


7.1 


(32) 


1.5 


VI & VII 


3 


Arithmetic 
Computation 


9 


(25) 


12.2 


(34) 


1.8 


VI & VII 


8 


Language 
Mechanics 


14 


(39) 


10.7 


(30) 


1.8 
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Independence of biased item tests > The data from the initial study 
indicated that the biased item set favoring each group usually measured 
different things. The new analysis permits cross validation of this re- 
sult s.:pce the halves of the groups not used in choosing the items also 
obtained scores on these sets of items. This cross validation was done 
for the grade 5 Reading Coii?)rehension Test in groups I and II, In the ini- 
tial study the correlations between the two sets of the biased items were 
.55 and .36 for groups I and II, respectively. The median correlations for 
the 100 criterion halves were ^54 and .35, respectively. For the blacks 
the two tests measure substantially different things. The medians for the 
cross validation halves were -57 and .40. Since the size of these coeffi- 
cients are .about vhat was obtained initially, the correction for attenuation 
should also yield about the same results. The cross validation correlations 
do tend to be slightly higher for both groups indicating a some what lesser 
tendency for the two tests to measure different things but the differences 
are not sufficient to alter the interpretation that the two sets of items 
tend to measure rather different things in both groups. Again the results 
of the initial study are confirmed. 

Changes in test scores . The final matter to be verified is the phe- 
nomenon considered in Table 4, the advantage to a grotxp in mean score rel- 
ative to the other group of having the test consist of items chosen as best 
for them. In Table 8 the relevant data from the initial study is con5>ared 
to the corresponding means for the criterion halves and the 100 cross vali- 
dation halves. The outcome of the initial study is fully supported for the 
first and last of the six tests considered but either unsupported or con- 
tradicted by the data for the other four. From these results we conclude 
that biased item tests as we have defined them do not necessarily yield 
relatively higher or lower scores than do other item sets for any group. 
On the other hand, in some cases, as illustrated by the Language Mechanics 
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test a pronounced advantage does occur. Note that a .08 nean difference in 
item difficulty is the equivalent of a six point change in score, which is 
about sixteen percentile points around the median. 

This rehash of the procedures and data of the initial study still leaves 
the interpretations somewhat ambiguous. The point biserial approach appears 
to show some bias in some CAT tests against minority groups but in very small 
anounts in all but a couple of instances. Since the items examined were all 
preselected on r.he basis of data from a single "standard" (i.e., heteroge- 
neous) tryout sample, it is quite possible that these data produce an under- 
estimation of the amount of bias. Furthermore, it is plain that the inter- 
pretation of differences in point biserials as bias is not in itself unam- 
biguous unless one can adequately account for the role of difficulty in these 
differences. In some instances items have low point biserials because of 
floor or ceiling effects. However, examination of the distributions of item 
difficulties obtained before and after selection based on point biserials 
does not show much change although extremely easy and extremely difficult 
items tend to be eliminated. The distributions vary from excellent to ter- 
rible for the tests and groups considered, but these distributions do not 
seem to be directly related to any conclusions drawn about bias so far. We 
will consider the many questions that arise about difficulty again later in 
the paper since some of what follows throws light on the matter. 

In any case it is obvious that using differential item-test correlations 
as the criterion of bias is not the only reasonable approach to assessing bias 
in an achievement test and so the next step is to consider other approaches. 

GROUP BY SCORE LEVEL INTERACTIONS 

A customary way of looking at item analysis data is to divide the try- 
out group into fourths, or fifths based on quartiles or quintiles, respec- 
tively, and examine the proportion of the cases making each possible 



response in each of these levels. Given different tryout groups such data 
can be examined for interaction by a chi square test. This can be done 
for each possible response. Black and standard groups w<?.re used in item 
tryout s for a number of tests now under construction at CIB. Chi square 
tests for interactions between black and white tryout group response pat- 
terns were undertaken for these data and for the group I and group II grade 
5 data from the initial study. Table 9 gives the information obtained on 
two versions of an item meant for a first grade oral usage test. The first 
version of the item is as follows: Look at the first picture. This girl 
can draw. Now look at the second picture. Listen to this sentence c This 
is the picture she drau)ed. I will say it again. This is the picture she 
drawed. Mark your answer. The second version is identical except that the 
sentence reads "This is the picture she has drawn." The first version pro- 
duces a significant interaction while the second does not largely because 
it does not function well in either group. Figures 1 and 2 illustrate the 
distribution by fifths. 

Table 10 indicates the frequency with which interactions were found 
for various tests. The reason for the high frequency of "biased" items 
in the CAT grade 5 tests and the Science Skills tests is not apparent al- 
though one can easily imagine why phonological discrimination and oral usage 
items appear to be discriminatory* 

K 

The one characteristic the science tests and the CAT test sha/ce that 

If 

the others do not is that they have a uniformly large white-black ^difference 
in item difficulty. Very few of the CAT items have group by fifth plots that 
cross and few of them look much like either of the figures given. i Instead 
they tend to look like that for item 16 (see Figure 15) of the grade 5 reading 

comprehension test. The point biserials are almost the same ^ .39 arid r36, 

\ 

but the chi square is 11.7, Clearly difficulty is functioning^ differently 
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Table 9 



Item Ar.alysis Data on Two Versions of an Oral 
Usag^i Item for Standard and Black Tryout Groups 



STAND)\RD 

Form X Oral Usage 
Item 17 Grade 1.2 





A 


B* 


Percent 


.53 


.45 


High 5th 


.12 


.88 


Mid 5th 


.23 


.77 


Mid 5th 


.54 


.44 


Mid 5th 


.88 


.09 


Low 5th 


.89 


.05 


Choice N 


126 


106 


Test Mean 


27.1 


38.2 



. Item Statistics 

Difficulty = ,447 
Point Biserial = .657 
Biserial = .827 
Select = .008 



Summary Data 

N = 237 Mean = 31.96 

KR 20 = .87 S.D. = 8.09 

= 



BLACK 

Form X Oral Usage 
Item 17 Grade 1.2 





A 


B* 


Percent 


.69 


.26 


High 5th 


.39 


.61 


Mid 5th 


.78 


.22 


Mid 5th 


.80 


-17 


Mid 5th 


.81 


.14 


Low 5th 


.70 


.13 


Choice N 


124 


46 


Test Mean 


26-8 


32.1 



Item Statistics 

Difficulty = .256 
Point Biserial = .325 
Biserial = .443 
Select = .005 



Summary Data 

N = 180 Mean = 27.58 

KR 20 = .80 S.D. = 7.01 

41.8 



STANDARD 



Form Y Oral Usage 




Item 17 Grade 1.2 






A* 


B 


Percent 


.79 


.18 


High 5th 


.77 


.23 


Mid 5th 


.67 


.33 


Mid 5th 


.81 


.16 


Mid 5th 


.90 


.10 


Low 5th 


.80 


.11 


Choice N 


201 


47 


Test Mean 


30.6 


33.4 


Item Statistics 




Difficulty = .788 




Point Biserial = 


.080 


Biserial 


= .112 




Select = 


,004 





Summary Data 

N = 255 Mean ^ 30.71 

KR 20 = .85 S.D, = 7.77 



BLACK 



Form Y Oral Usage 




Item 17 Grade 


1.2 






A* 


B 


Percent 


.79 


.16 


High 5th 


.85 


.15 


Mid 5th 


.87 


.07 


Mid 5th 


.80 


.17 


Mid 5th 


.90 


.10 


Low 5th 


.61 


.24 


Choice N 


120 


24 


Test Mean 28.8 


26.8 


Item Statistics 






Difficulty 


= -795 




Point Biserial = 


.171 


Biserial = 


.242 




Select = .001 





Summary Data 

N = 151 Mean = 27.95 

KR 20 = .80 S.D. = 6.97 

7.6 
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If the product is temporarily defined as a diagonal matrix A by 



requiring that the columns of A and F are orthogonal, we have 

c ci 

Y' Y = F A F'.. 
i j ci c cj 

Then the expression for the symmetric matrix 

• Y! Y, Y! Y = F , A F' F A FV = F A^ F'' , 
i j J i ci c cj cj c cl ci c ci* 

is in the form of an easily soluable eigenvalue-eigenvector problem^ 

Then we may solve for F from the expression 

F' = A""^ F' Y' Y, 
cj c ci i j since 

Y' Y = F . A F' , 
i j ci c cj' 

F' Y' Y = A F' . 
ci i j c CJ and 

A' F' y: Y. = F' . 

c ci i J cj 

Given Y^ , F^_^, and A^ we may solve for A^ by thp. expression in adjoined 

matrices belowc 
r 



ci 



F . 



A"^ = A . 
c c 



Now the expressions 



(Y^ - F;P' (Y^ - f;) = F^ FJ, where A. = A; A^, 
(Y^ - = A^ FJ, and (Y^ - A^F^^) F . = A^ = 

allow us to solve for A^ and F^ which completes^'tJi-e.^estimation of para- 



meters • 
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Table 10 



Proportion of Items in Various Tests 
Showing Group by Score Interactions 







Number 


Interactions 


Test Name Grade 


of Items 










Examined 


Number 


ProDor tion 


PRDIARY READING 










Letter Names 


1 


5 


0 


0 


Letter Sounds 


1 


37 


9 


.24 


Letter Sounds 


2 


12 


1 


.08 


Visual Discrimination 


1 


10 


1 


.10 


Visual Discrimination 


2 


5 


0 


0 


Listening Comprehension 


1 


20 


6 


.30 


Totals 




89 


17 


.19 


READING 










Word Reading 


2 


34 


8 


c24 


Phonic Errors 


2 


30 


4 


.13 


Reading Comprehension 


2 


30 


6 


.20 


Reading Comprehension 


3 


65 


17 


.26 


Reading Vocabulary 


3 


39 


15 


.38 


Totals 




198 


50 


.25 


LANGUAGE 










Phonological Discrimination 


1 


52 


28 


.54 


Oral Usage 


1 


92 


35 


.38 


Oral Language 


2 


24 


4 


.17 


Punctuation & Capitalization 


3 


34 


16 


.47 


Language Usage 


3 


24 


4 


.17 


Spelling 


3 


34 


3 


r 09 


Totals 




260 


90 


. 35 


SCIENCE 










Science 


3 


40 


31 


78 


Science 


6 


43 


34 


.79 


Science 


8 


18 


13 


.72 


Science 


10 


75 


45 


.60 


Science 


12 


52 


36 


.69 


Totals 




228 


159 


.70 


SOCIAL SCIENCE 










Social Science 


3 


49 


18 


r 37 


Social Science 


6 


76 


22 


.29 


Social Science 


8 


28 


6 


.21 


Social Science 


10 


77 


16 


.21 


Totals 




230 


62 


.27 


MATHEIIATICS 










Mathematics 


1 


42 


8 


.19 


Ilath Computation 


2 


58 


7 


.12 


Math Computation 


3 


58 


17 


.29 


Math Concepts & Applications 


2 


40 


7 


.18 


Math Concepts & Applications 


3 


62 


10 . 


.16 


Totals 




260 


49 


.19 


C.A.T. 










Reading Vocabulary 


5 


33 


27 


.82 


Reading Comprehension 


5 


39 


32 


.82 


Math Computation 


5 


63 


52 


.82 


Math Concepts & Problems 


5 


36 


27 


.75 


Totals 




171 


138 


.81 
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This method may be generalized to the case where there are m sub- 
groups of interest. The overall model is then 



r 



m I 



m 



0 
0 



0 

0 



0 
0 



m 



r I 

F 



. F 

ci C2 



0 
0 



The F^^ may be estimated from the recursive set of equations 



cm 



0 
0 



m 



Yl Y2 Y2 Y3 Y3 . . . Y. Y^ = F . A^" F , 

m m ci ci 

.-(m-1) ' t , , 

A ' ^ F^i Yl Y2 Y2 . . . Y_ , Y , Y = F 



^-(m-2) ^'^ Y2 Y2 . 



I 

cm 



m-1 m-1 m 

I I 

0 Y o Y = F , 
m— / in-2 m— 1 cm- 1 



-1 * • • 

A F ^ Y. Yo = F 
ci 1 ^ 



C2 



then 



rYi:Y2:. . .:y ] 

' . p . m I 



^cl 



F 2 
c^ 



cm 



A = A and for each Y^ - A F , 
c c i c ci' 



and ¥^ may be estimated as before. 
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Figure 15. Mean Item Difficulties in Total Score Fifths (Low to High) for 
Groups I and II on Several Items of the Grade 5 Reading 
Comprehension Test. J28 



Results of the Application of the 
Inter-Group Analysis 

Computer programming problems precluded having an inter-group analysis 
for three or more groups available at this time. We do have however, the 
results of two different two group analyses. One analysis was done for pre- 
viously defined groups I and II as a black-white inter-group analysis, the 
other analysis involved the two white groups I and III so as to establish 
a benchmark for the interpretation of the white-black inter-group results. 

The first of these two analyses were based on the data from 270 black 
and 360 white fifth grade students on the 42 items in the CAT-70 reading 
comprehension test. For the other analysis the group I data were put to- 
gether with data on the same test obtained from 396 5th grade students in . 
group III. For both analyses the first stage of parameter estimation was 
to estimate the eigenvalues of the common inter-group space. In both cases 
plots of the eigenvalues indicate that the common space should be consid- 
ered to consist of three dimensions; in both cases three dimensions accoun- 
ted for almost 80% of the common space variance. 

In the next step of parameter estimation three columns of each of the 
matrices which were with respect to sources of variation common to groups 
were estimated. This was done for each of the two analyses and then the 
eigenvalues of the group specific spaces were estimated. In general it 
appeared there were three dimensions as well to each group specific space. 

The third step was to estimate three columns of each group specific 
matrix and then to determine the various proportions of variance of the to- 
tal space which could be explained by each source. Tables 11 and 12 contain 
the proportions of test variance and Tables 13 and 14 contain frequencies of 
items which fall into classes according to the proportion of item variance 
attributable to the group specific source. 



in the two approaches. Let us, therefore, consider difficulty more directly. 

ADJUSTED DIFFERENCES IN ITEM DIFFICULTIES 

Angoff has examined a way of looking at plots of item difficulties so 
as to locate aberrant items In a subtest and thus to examine them for un- 
fairness and exclusion. As a modification of this procedure, it is suggested 
that the item-test biserial correlations be incorporated in the procedure 
so as to estimate linear test score-item score regression whereby adjusted 
item difficulties may be formed in a manner analogous to the way in which 
adjusted means are formed in an Analysis of Covariance, Such a procedure 
would allow the effect of differential item-test correlations as well as 
differential item difficulty to influence the location of aberrant items. 
For example, if such a procedure were enployed, an item which would be 
deviant in terms of item difficulties and which would have a low item-test 
correlation for one or both groups would show up as relatively more deviant 
than a sind.lar item with a high item-test correlation. It may be argued that 
an item showing more aberrance in adjusted item difficulty is indeed more 
deviant than one could have inferred from the unadjusted plot. 

Such an argument rests upon the contention that given two items with 
equal differences in difficulty for two groups, the item which is more 
strongly related to the test score is more likely to be reflecting group 
differences in the behaviors the test accesses than is likely for the item 
which is less strongly related to the test score. 

Both adjusted and ordinary item difficulties were calculated from a 
set of primary level item ti^^out data obtained from black and standard 
tryout samples. Figures 3 and 4 show the relationship between the two 
plotting procedures. Note that the two do not produce the same ordering 
of the items as to aberrance. 
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Table 11 

Proportions of Variance for the Group I-II Comparison 





Source 


I 


Group 


II 


Common Intergroup 


67 




60 


Group Specific 


11 




14 


Residual and Error 


22 




26 



Table 12 

Proportions of Variance for the Group I-III Comparison 



Source 


I 


Group 


III 


Common Intergroup 


70 




70 


Group Specific 


9 




8 


Residual and Error 


21 




22 



43 




Standard 



Figure 3. Cross Plots of Item Difficulties on a Phonological Discrimination Test for 
Grade 1 Standard and Black Tryout Groups. 
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Table 13 



i 
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Frequencies of Items within Categories of 
Group Specific Variance Accoxinted for by Groups I and II 







Group 




Proportion of Group 








Specific Variance 


I 




II 



0 


< 


X 


< 


.05 


17 


16 


.05 


< 


X 


< 


.10 


10 


10 


.10 


< 


X 


< 


.15 


7 


2 


.15 


< 


X 


< 


.20 


2 


7 


.20 


< 


X 


< 


.25 


3 


2 



.25 < X < .50 3 4 

.50 < X < 1.00 0 1 




Standard 



Figure 4, Cross Plots of Adjusted Item Difficulties on a Phonological Discrimination 
Test for the Grade 1 Standard and Black Tryout Groups Shown in Figure 3. 
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Table 14 



Frequencies of Items within Categories of 
Group Specific Variance Accounted for by Groups I and III 







Group 




Proportion of Group 








Specific Variance 


I 




III 



0 


< 


X < 


.05 


17 


16 


.05 


< 


X < 


.10 


12 


13 


.10 


< 


X < 


.15 


4 


5 


.15 


< 


X < 


.20 


4 


2 


.20 


< 


X < 


.25 


2 


4 


.25 


< 


X < 


.50 


3 


2 


.50 


< 


X < 


1.00 


0 


0 



ESTIMATION OF ITEM CHARACTERISTIC CURVES 

Another way of looking for items which may be in some sense unfair is 
to estimate and plot item characteristic curves separately for each group 
and to compare the plots. Item characteristic curves are essentially repre- 
sentations of nonlinear regressions of the probability of correct response 
on the latent trait which a test atteiq)ts to measure. If the test score is 
taken as a reasonable estimate of the trait, we may estimate the regression 
of the probability of correct response to an item on the test score by means 
of a higher order polynomial and plot the polynomial function as our esti- 
mate of the characteristic curve. The plots of characteristic curves ob- 
tained separately for groups of interest may then be superimposed and in- 
spected for possible group by item interaction. 

One weakness of this approach is that it requires a large number of 
subjects in each group in order to achieve estimates of any quality. If 
sufficient data can be obtained, however, the procedure provides a graphic 
representation which is easily inspected and which provides detail beyond 
the distribution by fifths. Our own experience has been that a few items 
which appeared to be acceptable when their distributions by fifths were 
examined, had estimated item characteristic curves which indicated that they 
were less than desirable. Figure 5 contains the most egregious example of 
all the estimated item characteristic curves which we plotted. Note that 
there is a group by item interaction and that the curves are not mono tonic 
or constantly increasing as is desirable of such curves. We have a hypoth- 
esis for the behavior of the curves in Figure 5 with respect to a reading 
comprehension item. The information required to answer the item is in the 
second sentei-.-e of a paragraph. The "topic sentence" read without the rest 
of the paragraph would lead one to select one of the incorrect foils. It 
is hypothesized that those students who scored in the lower half of their 
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An examination of Tables 11 and 12 lead to an estimate of approximately 
5% of group specific variance in the Group II model, beyond the benchmark 9 
or 10% that could be expected from very similar groups. For the test over- 
all this would seem to be neither an absolution of nor an indictment for 
unfairness c Tables 13 and 14 indicate much the same on the item level. If 
one were to arbitararily establish 25% as an undue amount of group specific 
item variance then there are 9 unfair items out of 42, since some of the 
same items are unfair for more than one group. Of the nine, two had greater 
than 50% of their item variance attributable to residual or errors and are 
thus just plain bad items. Excluding those items there are still more items, 
4, indicated as unfair for Group II than any other group. One of those items 
indicated as unfair by Group II i.s item 22 for which you have seen the plot 
of its estimated item characteristic curve (Figure 5). 

Figures 8 through 11 are the Group I and II plots of estimated item 
characteristic curves for the items which were indicated unfair by the Group 

11 analyses and Figure 12 is the plot of one of the items indicated unfair 
by the Group I analysis. 

Consider what the exclusion of the items plotted in Figures 8 through 

12 would do to total scores. The items in Figures 9 and 10 show a clear 
separation between the Group I and the Group II curves over the major por- 
tion of their score range with Group I above the curve for Group II. Thus 
it is clear that the exclusion the two items plotted in Figures 9 and 10 
would result in an increase in the probability of a higher relative score 
(relative to an overall mean score) for almost all of the individuals 
in Group II. Figure 11 on the other hand shows a separation only in the 
lower score range and thus the exclusion of this item would increase the 
probability of a higher relative score for only those members of Group II, 
who score in that lower range. The exclusion of the item plotted in Figure 
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Figure 8, The Estimated Item Characteristic Curve for Item Number 5. 
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groups tended to read the passage slowly but completely and therefore were 
not lead astray, while those scoring a bit higher overall tended to scan 
the passage too rapidly and were misled. However, the highest scoring 
students read quickly yet assimilated the entire passage and thus were not 
fooled. Lest you get the wrong impression. Figure 6 contains a more typical 
pair of plots and Figure 7 contains the best of 42 pairs of plots where "best" 
means the plots which appear most like the standard conception of a good 
ogive. 

INTERGROUP FACTOR ANALYTIC APPROACHES 

In this section, a type of factor or component analysis will be outlined 
which it is suggested may be useful in the examination of achievement tests 
for bias. In order to see why any type of factor analytic method would be ap- 
propriate, consider the nature of an achievement test. From an achievement 
test score one infers the location of a student in the domain of the test by 
means of a conglomerate statistic based on the evaluations of a series of re- 
sponses to items, items which may be considered stimulus aggregates. Each 
separate response evaluation is itself an inference based on assumptions about 
1) the conceived behavioral domain of the test, 2) the relevance of the item 
to that domain for the subject who responds, and 3) the mutual understanding 
of respondent and response evaluator about the general rules of response eval- 
uation. 

If the domain of a test is such that only one type of achievement be- 
havior is accessed by subjects in responding to the items, it is likely 
that the items will each relate in differing amount to the domain for a 
subject. The evaluations of item-domain relationships can be conceived of 
as forming a pattern. If two groups have different patterns of item- 
domain relations, it is possible that the domains of behavior accessed by 
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Figure 6. The Estimated Item Characteristic Curve for Item Number 37. 
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Figure 10. The Estimated Item Characteristic Curve for Item Number 15. 
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Figure 11. The?" Estimated Item Characteristic Curve for Item Number 16. 



the two groups in responding to the test are not the same. Factor analytic 
techniques may be employed to estimate item-factor relational patterns for 
groups and these patterns may be compared. 

A more complicated situation arises when the a priori domain of a test 
is not unitary or simple. In this case, the domain may be conceived of as 
a set of unitary or simple subdomains. Therefore, a set of item-sub domain 
patterns for each group being considered rather than a single pattern must 
be comparedc As before, factor analytic methods may be employed to estimate 
such patterns. 

Factor analysis attempts to estimate parameters on the right of the 
regression expression 

X = A f^ + e , 

where only the vector of subject response evaluations ^ is "known", and 
where A is a matrix of sets of known variable to factor variable relation- 
ship patterns, f^ is a vector of factor variables, and e is a vector of 
residual and error terms. If we allow the factor variables _f to represent 
locations in the subdomains and ^ to be the item response evaluations, then 
A will be the set of relational patterns to be compared. But suppose that 
for the individuals in some otherwise identifiable subgroup, the pattern 
of known variables to factor variables is not precisely the same as for all 
other subgroups. Let us then formulate a general model which expresses a 
relationship between known variables and two types of factor variables: 1) 
those that are common to all subgroups, and 2) those that are unique to a 
given subgroup. The following regression expression represents just such 
a relationship: 

y=Af+Af+e 
c ^ u — u 

where ^ and £ are defined as before and where the subscripts c and u indi- 
cate for the patterns and factor variables those that are common to all 
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12 would seem to produce no significant change in total score. 

Figure 8 concerns an item which provides an interesting example of 
how the different methods we have discussed can lead by themselves to 
different conclusions. The figure suggests that perhaps there is a 
group X score interaction but not a strong one. The test for inter- 
actions is significant when the quintiles are established for the groups 
separately, but the test is not significant when the groups are pooled to 
determine the quintiles for both groups. The difference in difficulty is 
relatively large, but the difference in adjusted item difficulties is near 
zero. The point biserial approach indicated that the item was fair (see 
item 5, Table 7), yet the inter-group factor approach indicated that a 
large proportion of item variance within Group II is due to group specific 
sources « Thus there are some seeming contradictions. Some clarification 
results when it is realized that an item indicated as unrelated to a test's 
common inter-group factors may be a good item for another test. Perhaps 
an inspection of Figures 13 and 14 will clarify the situation further. 
Note how differently the quintiles would be formed if groups were pooled 
and then look again at Figure 8 and note how the different ways of forming 
the five subgroups can cause the groups to either appear or not appear to 
be interacting with score. Note also that if both groups had the same score 
distribution as Group II that there would be no group difference in the 
overall difficulty on this item. Further note that given the distributions 
in Figures 13 and 14 the group difference in item difficulty occurs because 
the distribution for Group I is concentrated under the area of the curves 
where the Group I curve is higher but that the concentration of the Group 
II curve occurs under the area where the curves cross. 

Finally, note that there is a distinction between a conception of 
fairness of this item for a group and a conception of its fairness for an 
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subgroups with c and those that are specific to a particular subgroup with 
u. Note that this general model does not preclude either of the products 
A f or A f , f^oni forming a vector of all zeros; that is, either the com- 
mon or unique part may be nonexistent. If on the other hand both unique 
and common parts do exist, the model can provide a measure of the overall 
fairness of a subtest by determining the proportion of variance accounted 
for by the common part and it can provide a means of identifying items which 
may be in a sense unfair. For just as we may determine the proportion of 
subtest variance accounted for by the common part of the model we may deter- 
mine for each item the proportion of variance accounted for by the subgroup 
specific part of the model • If the amount of item variance accounted for 
by subgroup specific sources is large, then that item is probably unfair. 

This conception of fairness rests on the assumption that if there is a 
large part of the variance accoxinted for in either a test score or an item 
score which is not due to sources specific co a particular subgroup, but is 
due to sources common to all subgroups of interest, then that test or item 
is probably fair. 

The model which attributes test variance to coiranon factors for all 
groups, to specific factors for groups, and to item specific and residual 
sources is based on the idea of inter-battery factor analysis offered by 
Tucker (1958). In the inter-battery model, variance is partitioned into 
factors comDion to test batteries, factors specific to batteries, and test 
specific and residual variance. The inter-battery model requires each sub- 
ject to take all test batteries. The inter-group factor model presented in 
this paper requires that all groups take the same teste 

The estimation of model parameters in the inter-battery model rests 
on the assumption that only the factors common to batteries are involved in 
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Figure 12. The Estimated Item Characteristic Curve for Item Number 28. 



between battery cross products or convariances . Analagously, the interr 
group model's parameter estiraatior. is based on the assumption that only the 
factors common to groups are involved in between group cross products or 
convariances. In the handout, there is an outline of the procedure for 
estiwatina the parameters of an intergroup model which is for only two groups, 
followed by a demonstration of the extension of the procedure for the esti- 
Riation of parameters for a model for m groups. 

Method of deteriiination of model parameter esti:::at:es . The model for 
a subgroup, as opposed to the model for an individual in a subgroup, may 
be written as 

Y = A F' . + F! + 
i c ci i i i 

where is a ir3t:;ix of item response evaluation with p rov7s for each of 

the p items and n^ columns for each of the n subjects in subgroup i, A 

^ c 
is the comi.ion pattern matrix which is p by k the number of common factor 

variables, Fj^^ is the matrix of common factor scores which is k by n . A 

i i 

is the matrix of subgroup specific patterns and is p by the number of 

specific fcictor variables, F' is the matrix of specific factor scores, q ' 

by n^ and E is the matrix of error and residuals. The conception of A as 

i 

subgroup specific allows the definition of A^ as orthogonal to A . i i 

so that A - 0 the zero matrix. Further note the definition E' E =0 

, i j 

Thus for the pair of subgroup evaluation matrices Y. and Y we may write 
the expression for the product Y^ Y_. as the identity 

; ■ ^ci \ Ki ^ ^1 ^ ^- ^1 A. f;. . Ai A^ p.. 

But A^ A_., A_^, and A_. are all equal to zero matrices. Thus, 

Y! Y. = F , A' A F' . 
i J ci c c cj 
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Figure 13. Frequency Distribution of Raw Scores for Group I on the Grade 5 
Reading Comprehension Test. 
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Figure 14. Frequency Distribution of Raw Scores for Group II on the Grade 5 
Reading Comprehension Test. 



individual in a group. For while the unconditioned probability of correct 
response to the item is greater for Group I, the probability conditioned 
on a score of 15 is not higher for either group. Thus if you were a fifth 
grader of either group who was likely to score a 15 on the test, the item 
may be more fair for you than for others. 

ITEM DIFFICULTY AND BIAS 

Do the preceding statements mean that item difficulty is the heart of 
test bias? Not by our definition although plainly the notion of an unusual 
difference in difficulty between groups is a useful indication that an item 
may be biased against one of the groups. Also it is clear that items that 
are too easy or too hard for only one of a pair of groups cannot measure the 
same thing in equal amounts in both groups. Thus some items and some tests 
are biased only because they are inappropriately difficult or easy for one 
or the other group. The effect is a set of scores too high or too low and 
can be classified as bias of Type A. Nevertheless the item will not prove 
to be biased for members of the group for whom it is not too hard or easy 
because it is measuring the appropriate trait. Thus if extreme difficulty 
is the only factor involved the bias is merely inappropriate use. Also 
note that it is entirely possible for bias of Type A to occur among items 
of just the right difficulty for a group. 

Difficulty enters in some fashion into all approaches to assessing bias. 
Point biserials are necessarily low for really extreme difficulty values but 
one would ordinarily reject such items anyway. More to the point is that 
items that are really too difficult do not measure what they are meant to 
because too many people are responding to aspects of them not meant to be 
determining elements. Accordingly such items have low point biserials. 
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This sort of thing happens often to at least a few people on almost 
any item. The distributions by fifths demonstrate this well. Figure 15 
shows some of the reading comprehension items and it can be seen that ceiling 
and floor effects for the top or bottom of one or the other groups is com- 
mon and perhaps explains the high frequency of significant interactions. The 
two highest chi square values for reading comprehension are for items 20 and 
26 which demonstrate just such effects. They are, nevertheless, good items 
by other criteria. The interaction approach appears to be unduly influenced 
by difficulty and will lead to faulty conclusions about bias when the groups 
compared differ appreciably in mean score. In such a case items are fre- 
quently too easy or too hard for some substantial portion of one or both 
groups. To have it otherwise is not easily accomplished and it is not al- 
ways desirable to restrict the range that a test can cover. 

Still, it is our experience that at least for some topics at some levels 
it is possible to reduce the differences in difficulty between groups similar 
to I and II without any apparent decrease in content validity. With the 
exception of the science tests, we think we have accomplished this substan- 
tially for the tests listed in Table 7 although proof of success awaits 
standardization.* 

Finally we would like to draw attention to the contrast between the 
item characteristic curves of Figures 5-12 and the mean difficulties of the 
fifths displayed in Figure 15. This contrast emphasizes how the distribu- 
tion of the scores of a particular group may conceal the characteristics of 
the item for that group over the full range of scores. In the item charac- 
teristic curves the role of relative difficulties is displayed at each score 

* No attempt was made with the CAT tests because the tryout data were not 
available. 
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point. Under these circumstances the relationship between differential con- 
ditional difficulties becomes a criterion of bias. However, if one must rely 
on a statistic, the point biserials are more likely to suggest what the 
characteristic curves would be like than any other statistic, because it 
represents a linear approximation of such curves, 

CONCLUSIONS 

It may be apparent that we have some preferences among the approaches 
examined for determining item bias and test bias. We believe use of item 
characteristic curves of the sort described here and intergroup factor analy- 
sis will permit test builders to build both fairer and more generally effec- 
tive tests. However, we find merit and value in all approaches since they 
each provide some relevant information and none of them are completely redun- 
dant. We will, however, continue to look for better ways to proceed since 
the efficiency of such an eclectic approach is rather obviously low. 

In the meantime, we would like to make six points we believe our data 
support. 

1) Bias against various groups in achievement tests occurs but it 
may well be small and unimportant in amount for most groups; we simply can- 
not say from our data and procedures, nor do we know of any other data that 
can answer that question adequately. — 

2) Item bias and test bias are not quite one and the same thing. 
Thus a demonstration that a test has some biased items does not necessarily 
prove that the test score overall is biased since some items may balance 
Others . 

• 3) Nevertheless finding biased items and fixing or eliminating them 

appears to be important at least until one can find ways to demonstrate that 
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the amount of bias is unimportant. Furthermore biased items are often bad 
items generally. 

4) Nor is group bias identical with bias against all members of a 
group, for the first can exist in the absence of the second. 

5) Most of the ambiguities in determining bias that we have noted 
stem from the lack of appropriate external criteria. Work of the sort 
being reported here by Cay lor and by Williams should be emulated by all. 
External criteria ought to be found for all tests even if their relevance 
is indirect. 

6) Thus we are asking for a reconsideration of the construction and 
validation procedures used with achievement tests. The internal character- 
istics of a test along with armchair decisions about content validity (how- 
ever expert the judges may be) do not provide an adequate basis for judging 
validity. Content validity procedures probably obviate bias only for the 
item writers. 

The issue of test bias tends to arouse the emotions of a number of peo- 
ple. Perhaps as a consequence of this concern and attention people will be 
willing to undertake and support efforts to carry out the full program of 
research any published test should have. 
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